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1 Introduction and historical background 

This article focuses on an important piece of work of the world renowned 
Indian statistician, Calyampudi Radhakrishna Rao. In 1945, C. R. Rao (25 
years old then) published a pathbreaking paper [13], which had a profound 
impact on subsequent statistical research. Roughly speaking, Rao obtained a 
lower bound to the variance of an estimator. The importance of this work can 
be gauged, for instance, by the fact that it has been reprinted in the volume 
Breakthroughs in Statistics: Foundations and Basic Theory (32]. There have 
been two major impacts of this work: 

• First, it answers a fundamental question statisticians have always been 
interested in, namely, how good can a statistical estimator be? Is there 
a fundamental limit when estimating statistical parameters? 

• Second, it opens up a novel paradigm by introducing differential geo- 
metric modeling ideas to the field of Statistics. In recent years, this 
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contribution has led to the birth of a flourishing field of Information 
Geometry [5]. 



It is interesting to note that H. Cramer [20] (1893-1985) also dealt with 
the same problem in his classic book Mathematical Methods of Statistics, 
published in 1946, more or less at the same time Rao's work was published. 
The result is widely acknowledged nowadays as the Cramer- Rao lower bound 
(CRLB). The lower bound was also reported independently^ in the work of 
M. Frechet [27J (uniparameter case) and G. Darmois [22] (multi-parameter 
case). The Frechet-Darmois work were both published in French, somewhat 
limiting its international scientific exposure. Thus the lower bound is also 
sometimes called the Cramer-Rao-Frechet-Darmois lower bound. 

This review article is organized as follows: Section [2] introduces the two 
fundamental contributions in C. R. Rao's paper: 

• The Cramer- Rao lower bound (CRLB), and 

• The Fisher-Rao Riemannian geometry. 

Section [3] concisely explains how information geometry has since evolved 
into a full-fledged discipline. Finally, Section H] concludes this review by 
discussing further perspectives of information geometry and hinting at the 
future challenges. 



2 Two key contributions to Statistics 

To begin with, we describe the two key contributions of Rao [13], namely a 
lower bound to the variance of an estimator and Rao's Riemannian informa- 
tion geometry. 

1 The author thanks F. Barbaresco for bringing the historical references to his attention. 
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2.1 Rao's lower bound for statistical estimators 

For a fixed integer n > 2, let {X\, X n } be a random sample of size n 
on a random variable X which has a probability density function (pdf) (or, 
probability mass function (pmf)) p(x). Suppose the unknown distribution 
p(x) belongs to a parameterized family T of distributions 

T={p e ix) | 9eG}, 

where 9 is a parameter vector belonging to the parameter space O. For 
example, T can be chosen as the family J-Gaussian of all normal distributions 
with parameters 6 = (/i, a) (with 9 G = R x R + ): 

The unknown distribution p{x) = pe*(x) G T is identified by a unique 
parameter 9* G 0. One of the major problems in Statistics is to build an 
"estimator" of 9* on the basis of the sample observations {Xi, . . . , X n }. 

There are various estimation procedures available in the literature, e.g., 
the method of moments and the method of maximum likelihood; for a more 
comprehensive account on estimation theory, see e.g., [33]. From a given 
sample of fixed size n, one can get several estimators of the same parameter. 
A natural question then is: which estimator should one use and how their 
performance compare to each other. This is related precisely with C. R. 
Rao's first contribution in his seminal paper [13]. Rao addresses the following 
question: 

What is the accuracy attainable in the estimation of statistical parameters? 

Before proceeding further, it is important to make some observations on 
the notion of likelihood, introduced by Sir R. A. Fisher [26]. Let {Xi, . . . , X n } 
be a random vector with pdf (or, pmf) pe{x%, . . . , x n ), 9 G 0, where for 
1< i < realization of Xj. The function 

L(9] x 1 ,...,x n )=pe(xi,..., x n ), 

considered as a function of 9, is called the likelihood function. If X±, . . . , X n 
are independent and identically distributed random variables with pdf (or, 
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pmf) pg(x) (for instance, if X ± , . . . ,X n is a random sample from pe(x)), the 
likelihood function is 

n 

L(9;x u ...,x n ) = Y[p e (xi). 

i=i 

The method of maximum likelihood estimation consists of choosing an 
estimator of 9, say 9 that maximizes L(9; x±, . . . , x n ). If such a 9 exists, we 
call it a maximum likelihood estimator (MLE) of 9. Maximizing the likeli- 
hood function is mathematically equivalent to maximizing the log-likelihood 
function since the logarithm function is a strictly increasing function. The 
log-likelihood function is usually simpler to optimize. We shall write l(x,9) 
to denote the log-likelihood function with ). Finally, we recall 

the definition of an unbiased estimator. Let {pe, 9 G 0} be a set of probabil- 
ity distribution functions. An estimator T is said to be an unbiased estimator 
of 9 if the expectation of T, 

E e (T) = 9, for all 9 e 0. 

Consider probability distributions with pdf (or, pmf) satisfying the fol- 
lowing regularity conditions: 

• The support {x \ pe(x) > 0} is identical for all distributions (and thus 
does not depend on 9), 

• f po(x)dx can be differentiated under the integral sign with respect to 
0, 

• The gradient VoPo(x) exists. 

We are now ready to state C. R. Rao's fundamental limit of estimators. 

2.1.1 Rao's lower bound: Single parameter case 

Let us first consider the case of uni-parameter distributions like Poisson dis- 
tributions with mean parameter A. These families are also called order-1 
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families of probabilities. The C. R. Rao lower bound in the case of uni- 
parameter distributions can be stated now. 



Theorem 1 (Rao lower bound (RLB)) Suppose the regularity condi- 
tions stated above hold. Then the variance of any unbiased estimator 9, 
based on an independent and identically distributed (IID) random sample of 



size n, is bounded below by 



nl(9*) 

in a single observation, defined as 



, where 1(9) denotes the Fisher information 



1(9) = -E e 



d 2 l(x;9) 
d9 2 



I 



d 2 l(x; 9) 
d9 2 



pe(x)dx. 



As an illustration, consider the family of Poisson distributions with pa- 
rameter 9 — X. One can check that the regularity conditions hold. For a 
Poisson distribution with parameter A, l(x; X) = —A + log ^ and thus, 



x 



A' 



x 

X 2 ' 



l'(x;X) = -1 + 
/"(:r; A) = 

The first derivative is technically called the score function. It follows that 

"d 2 /(:r;A)' 



/(A) 



~ Ex dX 2 



since E[X] = A for a random variable X following a Poisson distribution 
with parameter A: X ~ Poisson(A). What the RLB theorem states in plain 
words is that for any unbiased estimator A based on an IID sample of size n 
of a Poisson distribution with parameter 9* = A*, the variance of A cannot 



go below 



i 



nl(\*) 



X*/n. 



The Fisher information, defined as the variance of the score, can be geo- 
metrically interpreted as the curvature of the log-likelihood function. When 
the curvature is low (log-likelihood curve is almost flat), we may expect some 
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large amount of deviation from the optimal 9*. But when the curvature is 
high (peaky log-likelihood), we rather expect a small amount of deviation 
from 9*. 



2.1.2 Rao's lower bound: Multi-parameter case 

For (^-dimensional multi-parametei§] distributions, the Fisher information ma- 
trix 1(9) is defined as the symmetric matrix with the following entries [BJ: 



En 



log Pe( x )-fif lo S Pe ( x ) 
8 8 

— logp e (x)— logpe(x) ) p (x)dx. 



(1) 
(2) 



Provided certain regularity conditions are met (see [BJ, section 2.2), the 
Fisher information matrix can be written equivalently as: 



[/(*)] 



ij 



8 2 



89,89. 



log p e (x) 



or as: 



-^Q^Pe(x)-^-^pe(x)dx. 

1 j 



In the case of multi-parameter distributions, the lower bound on the ac- 
curacy of unbiased estimators can be extended using the Lowner partial 
ordering on matrices defined by A y B A — B >z 0, where M >z means 
M is positive semidefinite [TT] (We similarly write M >- to indicate that 
M is positive definite). 

The Fisher information matrix is always positive semi-definite [33J. It 
can be shown that the Fisher information matrix of regular probability dis- 
tributions is positive definite, and therefore always invertible. Theorem 1 on 
the lower bound on the inaccuracy extends to the multi-parameter setting as 
follows: 



2 Multi-parameter distributions can be univariate like the ID Gaussians N(ji, a) or 
multivariate like the Dirichlet distributions or dD Gaussians. 
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Theorem 2 (Multi-parameter Rao lower bound (RLB)) Let 9 be a 

vector-valued parameter. Then for an unbiased estimator 9 of 9* based on 
a IID random sample of n observations, one has V[9) >z n~ l I~ l (9*), where 
V[9] now denotes the variance- covariance matrix of 9 andl~ 1 {9*) denotes the 
inverse of the Fisher information matrix evaluated at the optimal parameter 
9*. 



As an example, consider a IID random sample of size n from a normal 
population iV(/i*, a* 2 ), so that 9* = (fi*,o-* 2 ). One can then verify that the 
Fisher information matrix of a normal distribution N(fi, a 2 ) is given by 



Therefore, 



1(9) 



V[9] y n- l I{9*) 



i 

772 





1 

l^ 3 



n 



-v* 2 







2n"V 4 



There has been a continuous flow of research along the lines of the CRLB, 
including the case where the Fisher information matrix is singular (positive 
semidefinite, e.g. in statistical mixture models). We refer the reader to the 
book of Watanabe [17] for a modern algebraic treatment of degeneracies in 
statistical learning theory. 



2.2 Rao's Riemannian information geometry 

What further makes C. R. Rao's 1945 paper a truly impressive milestone 
in the development of Statistics is the introduction of differential geometric 
methods for modeling population spaces using the Fisher information matrix. 
Let us review the framework that literally opened up the field of information 
geometry [6]. 

Rao [33] introduced the notions of the Riemannian Fisher information 
metric and geodesic distance to the Statisticians. This differential ge- 
ometrization of Statistics gave birth to what is known now as the field of 
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information geometry [5]. Although there were already precursor geomet- 
ric work [5S"t [121 EE] linking geometry to statistics by the Indian commu- 
nity (Professors Mahalanobis and Bhattacharyya), none of them studied the 
differential concepts and made the connection with the Fisher information 
matrix. C. R. Rao is again a pioneer in offering Statisticians the geometric 
lens. 



2.2.1 The population space 

Consider a family of parametric probability distribution po(x) with x G M. d 
and 9 G M D denoting the D- dimensional parameters of distributions (order 
of the probability family). The population parameter space is the space 







po (x) dx 



A given distribution pe(x) is interpreted as a corresponding point indexed by 
9 G M. D . 9 also encodes a coordinate system to identify probability models: 
9 <rt p g (x). 

Consider now two infinitesimally close points 9 and 9+d9. Their probabil- 
ity densities differ by their first order differentials: dp(9). The distribution 
of dp over all the support aggregates the consequences of replacing 9 by 
9 + d9. Rao's revolutionary idea was to consider the relative discrepancy 
and to take the variance of this difference distribution to define the following 
quadratic differential form: 

D D 

i=i j=i 

= (V9) T G(9)\79, 

with the matrix entries of G{9) = (#)] as 

1 dp 1 dp 



g l3 (0) = E e 



_p(9) d9 iP (9) 09 j_ 



In differential geometry, we often use the symbol as a shortcut to " 
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The elements gij(0) form the quadratic differential form defining the el- 
ementary length of Riemannian geometry. The matrix G(0) = [gij(0)] y 
is positive definite and turns out to be equivalent to the Fisher information 
matrix: G(0) = 1(0). The information matrix is invariant to monotonous 
transformations of the parameter space [42] and makes it a good candidate 
for a Riemannian metric. 

We shall discuss later more on the concepts of invariance in statistical 
manifolds [TBI EB] ■ 

In [43], Rao proposed a novel versatile notion of statistical distance in- 
duced by the Riemannian geometry beyond the traditional Mahalanobis D- 
squared distance [35] and the Bhattacharyya distance [12] . The Mahalanobis 
.D-squared distance [35] of a vector i to a group of vectors with covariance 
matrix E and mean \x is defined originally as 

D l(x, fi) = (x- /i) T S" 1 (a; - y). 

The generic Mahalanobis distance Dm(p,q) = \f(v — q) T M(p — q) (with M 
positive definite) generalizes the Euclidean distance (M chosen as the identity 
matrix) . 

The Bhattacharyya distance [12] between two distributions indexed by 
parameters 0\ and O2 is defined by 

B(0 1} 2 ) = - log / ^p ei (x)pe 2 (x)dx. 
Jxex 

Although the Mahalanobis distance Dm is a metric (satisfying the triangle 
inequality and symmetry), the symmetric Bhattacharyya distance fails the 
triangle inequality. Nevertheless, it can be used to define the Hellinger metric 
distance H whose square is related the Bhattacharyya distance as follows 

H 2 (0 U 2 ) = i J (Vp^Jxj - Vp~^)) 2 dx = 1 - e -*<*A) < 1 (3) 
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2.2.2 Rao's distance: Riemannian distance between two popula- 
tions 

Let Pi and P2 be two points of the population space corresponding to the 
distributions with respective parameters Q\ and 62- In Riemannian geom- 
etry, the geodesies are the shortest paths. For example, the geodesies on 
the sphere are the arcs of great circles. The statistical distance between the 
two populations is defined by integrating the infinitesimal element lengths ds 
along the geodesic linking P\ and Pi. Equipped with the Fisher information 
matrix tensor 1(9), the Rao distance D(-,-) between two distributions on a 
statistical manifold can be calculated from the geodesic length as follows: 

D(p 9l (x),p 02 (x)) = min / (V(V0) T /(0)Vfl) dt 

m Jo v J 

0(O)=01,0(1)=02 

Therefore we need to calculate explicitly the geodesic linking pg 1 (x) to pg 2 (x) 
to compute Rao's distance. This is done by solving the following second 
order ordinary differential equation (ODE) [6j: 

9ki9i + Tk,ij9i9j = 0, 

where Einstein summation [6j convention has been used to simplify the math- 
ematical writing by removing the leading sum symbols. The coefficients T^ij 
are the Christoffel symbols of the first kind defined by: 

I ( dgik dgkj_ _ dgy \ 
k,ij 2 ^ m _ + dQk J ■ 

For a parametric statistical manifold with D parameters, there are D 3 
Christoffel symbols. In practice, it is difficult to explicitly compute the 
geodesies of the Fisher-Rao geometry of arbitrary models, and one needs 
to perform a gradient descent to find a local solution for the geodesies [41J. 
This is a drawback of the Rao's distance as it has to be checked manually 
whether the integral admits a closed-form expression or not. 
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To give an example of the Rao distance, consider the smooth manifold of 
univariate normal distributions, indexed by the 9 = (/i, a) coordinate system. 
The Fisher information matrix is 



1(6) 



4 

o 4 



y 0. (4) 



The infinitesimal element length is: 

ds 2 = (V9) T I(9)V9, 
d/i 2 2da 2 
^ + ^' 

After the minimization of the path length integral, the Rao distance be- 
tween two normal distributions [13J [8] 9i = (pi, ai) and 9 2 = (fi2, 0^2) is given 
by: 

v^log^ if/ii = /x 2 , 
D(9 l .H,)= { ^ tfa 1 = a 2 = a, 

\/2 log te " a 2 otherwise. 

tan 

where ai = arcsin f^, 0-2 = arcsin and 

, 2 , (»i - Z^) 2 - 2(a 2 2 - a{) 

12 1 + K^-^2? • 

For univariate normal distributions, Rao's distance amounts to computing 
the hyperbolic distance for H(^), see [3i] . 

Statistical distances play a key role in tests of significance and classifica- 
tion [42]. Rao's distance is a metric since it is a Riemannian geodesic dis- 
tance, and thus satisfies the triangle inequality. Rao's Riemannian geometric 
modeling of the population space is now commonly called the Fisher-Rao 
geometry [37]. One drawback of the Fisher- Rao geometry is the computer 
tractability of dealing with Riemannian geodesies. The following section 
concisely reviews the field of information geometry. 
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3 A brief overview of information geometry 



Since the seminal work of Rao [6] in 1945, the interplay of differential ge- 
ometry with statistics has further strengthened and developed into a new 
discipline called information geometry with a few dedicated monographs 
[SJ HOI EES El SSI E] • It has been proved by Chentsov and published in his Rus- 
sian monograph in 1972 (translated in English in 1982 by the AMS [18]) that 
the Fisher information matrix is the only invariant Riemannian metric for 
statistical manifolds (up to some scalar factor). Furthermore, Chentsov [T8"] 
proved that there exists a family of connections, termed the a-connections, 
that ensures statistical invariance. 

3.1 Statistical invariance and /-divergences 

A divergence is basically a smooth statistical distance that may not be sym- 
metric nor satisfy the triangle inequality. We denote by D(p : q) the di- 
vergence from distribution p(x) to distribution q(x), where the ":" notation 
emphasizes the fact that this dissimilarity measure may not be symmetric: 
D(p : q) D(q : p). 

It has been proved that the only statistical invariant divergences 0, H2] 
are the Ali-Silvey-Csiszar /-divergences Df [TJ [21] that are defined for a 
functional convex generator / satisfying /(l) = f'(l) = and /"(I) = 1 by: 

Dfr : ,) = (g) d, 

Indeed, under an invertible mapping function (with dim A = dim^ = d): 

m: X 

x h> y = m(x) 

a probability density p(x) is converted into another density q(y) such that: 
p(x)dx = q(y)dy, dy = \M(x)\dx, 
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where |M(x)| denotes the determinant of the Jacobian matrix [5] of the 
transformation m (i.e., the partial derivatives): 



dyi 



8yi 
dx d 



M(x) 



dxi 



9y d 
dx d 



It follows that 



q(y) = q{m(x)) = p(x)\M(x)\ 



-i 



For any two densities pi and P2, we have the /-divergence on the transformed 
densities q\ and q2 that can be rewritten mathematically as 



Furthermore, the /-divergences are the only divergences satisfying the re- 
markable data-processing theorem [23] that characterizes the property of 
information monotonicity [3] • Consider discrete distributions on an alphabet 
X of d letters. For any partition B — X\ U ...Xb of X that merge alphabet 
letters into b < d bins, we have 



where p\ and p2 are the discrete distribution induced by the partition B on 
X . That is, we loose discrimination power by coarse-graining the support of 
the distributions. 

The most fundamental /-divergence is the Kullback-Leibler divergence 
[T9] obtained for the generator /(x) = x\ogx: 




Df(pi : p 2 )- 



< ^D/(pi : p 2 ) < Df{px : p 2 ), 




dx. 
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The Kullback-Leibler divergence between two distributions p(x) and q(x) is 
equal to the cross-entropy H x (p : q) minus the Shannon entropy H(p): 

/p(xj 
p(x) log dx, 
q(x) 

= H*(p:q)-H(p). 

with 

H x {p:q) = J — p{x) logg(x)dx, 

H(p) = J — p{x) \ogp{x)dx = H x (p : p). 

The Kullback-Leibler divergence KL(p : p) [19] can be interpreted as the 
distance between the estimated distribution p (from the samples) and the 
true hidden distribution p. 



3.2 Information and sufficiency 

In general, statistical invariance is characterized under Markov morphisms 
[381112] (also called sufficient stochastic kernels [12]) that generalizes the de- 
terministic transformations y = m(x). Loosely speaking, a geometric para- 
metric statistical manifold T = {pe{x)\6 G 0} equipped with a /-divergence 
must also provide invariance by: 

Non-singular parameter reparameterization. That is, if we choose a 
different coordinate system, say 9' = f(9) for an invertible transforma- 
tion /, it should not impact the intrinsic distance between the underly- 
ing distributions. For example, whether we parametrize the Gaussian 
manifold by 9 = (//, a) or by 9' = (fi 3 ,a 2 ), it should preserve the dis- 
tance. 

Sufficient statistic. When making statistical inference, we use statistics 
T :R d ^ Q CR D (e.g., the mean statistic T n (X) = ± Y^=i x i is used 
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for estimating the parameter /i of Gaussians). In statistics, the concept 
of sufficiency was introduced by Fisher : 

"... the statistic chosen should summarize the whole of the relevant 
information supplied by the sample. " 

Mathematically, the fact that all information should be aggregated in- 
side the sufficient statistic is written as 



It is not surprising that all statistical information of a parametric dis- 
tribution with D parameters can be recovered from a set of D statis- 
tics. For example, the univariate Gaussian with d = dimX = 1 and 
D = dim© = 2 (for parameters 9 = (/i, a)) is recovered from the mean 
and variance statistics. A sufficient statistic is a set of statistics that 
compress information without loss for statistical inference. 

3.3 Sufficiency and exponential families 

The distributions admitting finite sufficient statistics are called the exponen- 
tial families [3TJ [HJ E] , and have their probability density or mass functions 
canonically rewritten as 



where k(x) is an auxiliary carrier measure, t(x) : M, d — > M D is the sufficient 
statistics, and F : MP — > K. a strictly convex and different iable function, 
called the cumulant function or the log normalizer since, 



See [6] for canonical decompositions of usual distributions (Gaussian, multi- 
nomial, etc.). The space G for which the log-integrals converge is called the 
natural parameter space. 
For example, 



Pr(z|t,0) = Pr(x|t). 



p e (x) = exp(6 T t(x) - F(6) + k(x)) 
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• Poisson distributions are univariate exponential distributions of order 
1 (with X = W = {0,1,2,3,...} and dim 6 = 1) with associated 
probability mass function: 



for k e N*. 

The canonical exponential family decomposition yields 

— t(x) = x: the sufficient statistic, 

— 6 = log A: the natural parameter, 

— F(6) = exp#: the cumulant function, 

— k(x) = — logx!: the carrier measure. 

• Univariate Gaussian distributions are distributions of order 2 (with 
X = R, dim X = 1 and dimB = 2), characterized by two parameters 
6 = (//, a) with associated density: 



for x G R. 

The canonical exponential family decomposition yields: 

— t{x) = (x,x 2 ): the sufficient statistic, 

— 9 = (61,62) = (-^2, ~2^y~ ^ ne na t ura l parameters, 



— k(x) = 0: the carrier measure. 

Exponential families provide a generic framework in Statistics, and are 
universal density approximators [2]. That is, any distribution can be arbi- 
trarily approximated closely by an exponential family. An exponential family 
is defined by the functions t(-) and k(-), and a member of it by a natural 




F(6) 




f- ) : the cumulant function, 
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parameter 9. The cumulant function F is evaluated by the log-Laplace trans- 
form. 

To illustrate the generic behavior of exponential families in Statistics |H], 
let us consider the maximum likelihood estimator for a distribution belonging 



where (V-F) -1 denotes the reciprocal gradient of F: (VF)~ 1 o VF = VF o 
(VF) -1 = Id, the identity function on M D . The Fisher information matrix 
of an exponential family is 



the Hessian of the log-normalizer, always positive-definite since F is strictly 
convex. 

3.4 Dual Bregman divergences and a-Divergences 

The Kullback-Leibler divergence between two distributions belonging to the 
same exponential families can be expressed equivalently as a Bregman diver- 
gence on the swapped natural parameters defined for the cumulant function 
F of the exponential family: 



As mentioned earlier, the ":" notation emphasizes that the distance is not a 
metric: It does not satisfy the symmetry nor the triangle inequality in gen- 
eral. Divergence Bp is called a Bregman divergence [13] . and is the canonical 
distances of dually flat spaces [6J. This Kullback-Leibler divergence on den- 
sities -H- divergence on parameters relies on the dual canonical parameteri- 
zation of exponential families [14J. A random variable X ~ p Fy g(x), whose 



to the exponential family. We have the MLE 9: 




1(9) = V 2 F(9) y 



KL(p F>gi (x) : Pf,6 2 (x)) 



B F (9 2 : 6>i), 

F{9 2 ) - F{9 X ) - (0 a - 9 i yVF(9 l ) 
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distribution belongs to an exponential family, can be dually indexed by its 
expectation parameter rj such that 

7] = E[t(X)} = [ xe eTt ^ F ^ +k(x Mx = VF(6). 
Jxex 

For example, the 77-parameterization of Poisson distribution is: rj = VF(9) = 
e e = A = E[X] (since t(x) = x). 

In fact, the Legendre-Fenchel convex duality is at the heart of information 
geometry: Any strictly convex and differentiable function F admits a dual 
convex conjugate F* such that: 

F*(rj) = ma 1 x6 T r ] - F(6). 
eee 

The maximum is attained for 77 = VF(9) and is unique since F(9) is strictly 
convex (V 2 F(#) >- 0). It follows that 9 = VF' 1 ^), where VF" 1 denotes 
the functional inverse gradient. This implies that: 

F*(rj) = tT(VF)-\t,) - F((VF)- 1 (t 7 )). 

The Legendre transformation is also called slope transformation since it maps 
9^-7] = VF(9), where VF(9) is the gradient at 9, visualized as the slope 
of the support tangent plane of F at 9. The transformation is an involution 
for strictly convex and differentiable functions: (F*)* = F. It follows that 
gradient of convex conjugates are reciprocal to each other: VF* = (VF) _1 . 
Legendre duality induces dual coordinate systems: 

V = VF(9), 
9 = VF*(rj). 

Furthermore, those dual coordinate systems are orthogonal to each other 
since, 

V 2 F(9)V 2 F*(r]) = Id, 

the identity matrix. 
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The Bregman divergence can also be rewritten in a canonical mixed co- 
ordinate form Cp or in the 9- or 77-coordinate systems as 

B F (9 2 :9 1 ) = F(9 2 ) + F*( Vl )-9 2 T Vl = C F (9 2 , Vl ) = C F 4 Vl ,9 2 ), 
= 5 F *(?7i : r) 2 ). 



Another use of the Legendre duality is to interpret the log-density of an 
exponential family dual Bregman divergence [9]: 

\ogppj A g{x) = -B F *{t{x) : 77) + F*(t(x)) + k(x), 

with 7] = VF(6) and 9 = VF*(rj). 

The Kullback-Leibler divergence (a /-divergence) is a particular di- 
vergence belonging to the 1-parameter family of divergences, called ot- 
divergences (see [6], p. 57). The a-divergences are defined for a 7^ ±1 

as 

4 / / l-Q l + a 

D a{P ■ q) = 1 _ 2 ( 1 - / P(X) 2 q[x) 2 d 

It follows that D a (q : p) = D- a (p : q), and in the limit case, we have: 



/p(x\ 
p(x) log — — dx. 
q(x) 

Divergence D\ is also called the reverse Kullback-Leibler divergence, and 
divergence Dq is four times the squared Hellinger distance mentioned earlier 
in eq. E] 

D (p : q) = D (q : p) = 4 M - J ^fpjx) ^q\x)dx^j = 4H 2 (p, q). 

In the sequel, we denote by D the divergence D-\ corresponding to the 
Kullback-Leibler divergence. 
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3.5 Exponential geodesies and mixture geodesies 

Information geometry as further pioneered by Amari [6J considers dual affine 
geometries introduced by a pair of connections: the a-connection and — a- 
connection instead of taking the Levi-Civita connection induced by the Fisher 
information Riemmanian metric of Rao. The ±l-connections give rise to 
dually flat spaces p] equipped with the Kullback-Leibler divergence [19]. 
The case of a = — 1 denotes the mixture family, and the exponential family 
is obtained for a = 1. We omit technical details in this expository paper, 
but refer the reader to the monograph [B] for details. 

For our purpose, let us say that the geodesies are defined not anymore as 
shortest path lengths (like in the metric case of the Fisher- Rao geometry) but 
rather as curves that ensures the parallel transport of vectors [6J. This defines 
the notion of "straightness" of lines. Riemannian geodesies satisfy both the 
straightness property and the minimum length requirements. Introducing 
dual connections, we do not have anymore distances interpreted as curve 
lengths, but the geodesies defined by the notion of straightness only. 

In information geometry, we have dual geodesies that are expressed for 
the exponential family (induced by a convex function F) in the dual affine 
coordinate systems 8/r] for a = ±1 as: 

Tl2 : L(e 1 ,e 2 ) = {e = (i-x)e 1 + xe 2 \xe[o,i)}, 

7l * 2 : L*( VhV2 ) = { V =(l-X) Vl + X V2 \Xe[0,l}}. 

Furthermore, there is a Pythagorean theorem that allows one to define 
information-theoretic projections [BJ. Consider three points p, q and r such 
that 7 pg is the ^-geodesic linking p to q, and 7* r is the ^-geodesic linking q 
to r. The geodesies are orthogonal at the intersection point q if and only if 
the Pythagorean relation is satisfied: 

D(p : r) = D(p : q) + D(q : r). 

In fact, a more general triangle relation (extending the law of cosines) exists: 

D(p : q) + D(q : r) - D(p : r) = (9(p) - 9(q)) J (r/(r) - r}(q)). 
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Note that the ^-geodesic 7 P9 and ^-geodesic 7* are orthogonal with respect 
to the inner product G(q) defined at q (with G(q) = I(q) being the Fisher 
information matrix at q). Two vectors u and v in the tangent place T q at q 
are said to be orthogonal if and only if their inner product equals zero: 

u ± q v <=?■ u T I(q)v = 0. 

Observe that in any tangent plane T x of the manifold, the inner product 
induces a squared Mahalanobis distance: 

D x (p, q) = (p- q) T I(x)(p - q). 

Since I(x) y is positive definite, we can apply Cholesky decomposition 
on the Fisher information matrix I(x) = L(x)L J (x), where L(x) is a lower 
triangular matrix with strictly positive diagonal entries. 

By mapping the points p to L(p) T in the tangent space T p , the 
squared Mahalanobis amounts to computing the squared Euclidean distance 
De(p, q) = \\p — q\\ 2 in the tangent planes: 

D x (p,q) = (p-q) T I(x)(p-q), 

= (p-q) T L(x)L' T (x)(p-q), 
= D E (L T (x)p,L T (x)q). 

It follows that after applying the "Cholesky transformation" of objects into 
the tangent planes, we can solve geometric problems in tangent planes as one 
usually does in the Euclidean geometry. 

Information geometry of dually flat spaces thus extend the traditional 
self-dual Euclidean geometry, obtained for the convex function F(x) = ^x T x 
(and corresponding to the statistical manifold of isotropic Gaussians). 

4 Conclusion and perspectives 

Rao' s paper [43] has been instrumental for the development of modern statis- 
tics. In this masterpiece, Rao introduced what is now commonly known as 
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the Cramer-Rao lower bound (CRLB) and the Fisher-Rao geometry. Both 
the contributions are related to the Fisher information, a concept due to 
Sir R. A. Fisher, the father of mathematical statistics (26] that introduced 
the concepts of consistency, efficiency and sufficiency of estimators. This 
paper is undoubtably recognized as the cornerstone for introducing differen- 
tial geometric methods in Statistics. This seminal work has inspired many 
researchers and has evolved into the field of information geometry [6]. Ge- 
ometry is originally the science of Earth measurements. But geometry is 
also the science of invariance as advocated by Felix Klein Erlang's program, 
the science of intrinsic measurement analysis. This expository paper has 
presented the two key contributions of C. R. Rao in his 1945 foundational 
paper, and briefly presented information geometry without the burden of 
differential geometry (e.g., vector fields, tensors, and connections). Informa- 
tion geometry has now ramified far beyond its initial statistical scope, and is 
further expanding prolifically in many different new horizons. To illustrate 
the versatility of information geometry, let us mention a few research areas: 

• Fisher- Rao Riemannian geometry [37] , 

• Amari's dual connection information geometry [6], 

• Infinite-dimensional exponential families and Orlicz spaces |16j . 

• Finsler information geometry [45] . 

• Optimal transport geometry [28J , 

• Symplectic geometry, Kahler manifolds and Siegel domains [TO] . 

• Geometry of proper scoring rules [25] , 



Geometry with its own specialized language, where words like distances, 
balls, geodesies, angles, orthogonal projections, etc., provides "thinking 



Quantum information geometry 
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tools" (affordances) to manipulate non-trivial mathematical objects and no- 
tions. The richness of geometric concepts in information geometry helps one 
to reinterpret, extend or design novel algorithms and data-structures by en- 
hancing creativity. For example, the traditional expectation-maximization 
(EM) algorithm [22] often used in Statistics has been reinterpreted and fur- 
ther extended using the framework of information-theoretic alternative pro- 
jections [3]. In machine learning, the famous boosting technique that learns 
a strong classifier by combining linearly weak weighted classifiers has been 
revisited [39] under the framework of information geometry. Another strik- 
ing example, is the study of the geometry of dependence and Gaussianity for 
Independent Component Analysis [15]. 
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